Add IaaS network provider integration support#5573
Conversation
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (25.00%) is below the target coverage (60.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #5573 +/- ##
==========================================
+ Coverage 67.70% 67.92% +0.21%
==========================================
Files 61 61
Lines 6360 6344 -16
==========================================
+ Hits 4306 4309 +3
+ Misses 1770 1756 -14
+ Partials 284 279 -5
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
1241014 to
eebb8c6
Compare
0212138 to
fcca226
Compare
a0ee2aa to
b84e3c7
Compare
bc0dc51 to
2fc704f
Compare
| if i.config.IaaSClient != nil { | ||
| logger.Debug("Calling IaaS provider to allocate IPs", zap.String("nic", *addArgs.IfName)) | ||
| if _, iaasErr := i.callIaaSAllocate(ctx, pod, results); iaasErr != nil { | ||
| logger.Error("IaaS allocate failed, continuing without IaaS allocation", zap.Error(iaasErr)) |
There was a problem hiding this comment.
continuing may be confusing, or we can use
IaaS allocate failed, aborting IPAM allocation
* add IaaS provider configuration parameters to values.yaml and README * mount IaaS TLS secret to controller and agent pods * add IaaS provider config to configmap * implement IaaS client with mTLS support for allocate/release IP operations * add IaaS config validation in controller and agent daemon startup * integrate IaaS client into IPAM workflow Signed-off-by: Cyclinder Kuo <qifeng.guo@daocloud.io>
* add VLAN_VERSION build argument and environment variable * clone vlan-cni repository and build vlan binary * copy vlan binary to release image at /usr/plugins/vlan * add --install-vlan flag to entrypoint.sh for conditional installation * add vlan plugin installation logic with version logging * export VLAN_VERSION in GitHub Actions workflow outputs * set default VLAN_VERSION to 0.0.1 in version.sh Signed-off-by: Cyclinder Kuo <qifeng.guo@daocloud.io>
4903a1f to
8cc9c9e
Compare
|
这样应该要有一个 docs 文档, 说明 对接 通用 的 iaasNetworkProvider ,也许不一定要要有 demo ,但是得说明一些原理:
|
5467f6c to
397edfe
Compare
Done, added the document. |
397edfe to
1880c03
Compare
|
|
||
| 分配接口位于 Pod 创建路径上。Provider 只有在 IaaS 侧 IP 绑定完成,并且返回的属性已经可以被 Spiderpool 使用后,才应该返回成功。 | ||
|
|
||
| 如果 Provider 无法完成 IP 绑定,必须返回非 `2xx` 状态码。Spiderpool 会将本次分配视为失败。 |
There was a problem hiding this comment.
遇到如下异常场景,如何 处理:
- 如果 provider 限流 ,导致处理时间很长, 超时 等不到 http 响应
- provider 侧故障了,没响应
- provider 就是怎么都无法 让 iaas 绑定成功
| iaasNetworkProvider: | ||
| serverUrl: "http://iaas-network-provider.iaas-network-provider-system.svc:80" | ||
| ``` | ||
|
|
There was a problem hiding this comment.
- 整个流程的潜台词 , 看来必须 使用 vlan cni 来 使用,配置方式 中没提到这个 ,并且 相关的 multus 配置 是否有讲究
- 没讲 Spiderpool 这边 的 ip 池 的 和 iaas vpc 的子网 , 两者是要 人工 在两边 手动创建的,并且 cidr 要一致
82af7ea to
cb3e2c2
Compare
| - enp0s28 | ||
| ippools: | ||
| ipv4: | ||
| - pool-enp0s28 |
There was a problem hiding this comment.
这里的这个网关的 连通性 是不是就要关掉了?没有意义啊。
* move netlink link setup logic from command_add.go to ipam_detection.go * set link up inside DetectIPConflictAndGatewayReachable before detection * change early return to continue when IP version is nil during detection * remove netlink import from command_add.go Signed-off-by: Cyclinder Kuo <qifeng.guo@daocloud.io>
cb3e2c2 to
32c5e52
Compare
* bump version to v1.2.0-rc1 (#5583) * robot updates the release version of the README file based on the release tag: v1.1.2 (#5592) Signed-off-by: weizhoublue <weizhou.lan@daocloud.io> Co-authored-by: cyclinder <59680092+cyclinder@users.noreply.github.com> * robot updates the release version of the README file based on the release tag: v1.0.6 (#5588) Signed-off-by: weizhoublue <weizhou.lan@daocloud.io> Co-authored-by: cyclinder <59680092+cyclinder@users.noreply.github.com> Co-authored-by: Cyclinder <qifeng.guo@daocloud.io> * fix: make Calico installation compatible with latest Calico and older Kubernetes versions (#5610) * Initial plan * fix: replace immutable Calico IPPool CIDR patch with delete+recreate In newer Calico versions (v3.30+), spec.cidr is immutable and cannot be changed via kubectl patch. The CI was failing with: The IPPool "default-ipv4-ippool" is invalid: spec.cidr: Invalid value: "string": CIDR cannot be changed; follow IP pool migration guide to avoid corruption. Replace the kubectl patch approach with a replace_ippool_cidr() helper function that: 1. Exports the current IPPool YAML 2. Strips server-managed fields 3. Updates spec.cidr to the desired value 4. Validates the new YAML with --dry-run=client before destructive ops 5. Deletes the existing pool 6. Recreates it with the updated YAML Agent-Logs-Url: https://github.com/spidernet-io/spiderpool/sessions/59cffb22-cbbc-4471-a713-16f925218051 Co-authored-by: cyclinder <59680092+cyclinder@users.noreply.github.com> * fix: strip x-kubernetes-validations from Calico CRDs for older K8s compatibility Calico v3.32.0 CRDs use x-kubernetes-validations with: - CEL functions like isCIDR() that require Kubernetes 1.29+ - Fields like 'reason'/'messageExpression' that require K8s 1.28+ These cause strict decoding errors and CEL evaluation failures when applying Calico CRDs on older K8s versions (e.g. K8s v1.27.1) used in the test matrix. Strip all x-kubernetes-validations from the downloaded Calico YAML before applying it. This only removes CRD input validation hints; it does not affect Calico's core networking functionality. Agent-Logs-Url: https://github.com/spidernet-io/spiderpool/sessions/d8d45c5c-363b-499a-b09f-0cf90ec8d056 Co-authored-by: cyclinder <59680092+cyclinder@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: cyclinder <59680092+cyclinder@users.noreply.github.com> * CI: Fix kdoctor connection issue (#5632) * Stabilize Nightly K8s Matrix E2E cleanup by making IPPool deletion idempotent (#5636) * Initial plan * test(e2e): ignore notfound when deleting ippool by name Agent-Logs-Url: https://github.com/spidernet-io/spiderpool/sessions/f9f1436f-70c0-4d93-8f46-ad7753b01d71 Co-authored-by: cyclinder <59680092+cyclinder@users.noreply.github.com> * chore: finalize nightly k8s matrix ci fix status Agent-Logs-Url: https://github.com/spidernet-io/spiderpool/sessions/f9f1436f-70c0-4d93-8f46-ad7753b01d71 Co-authored-by: cyclinder <59680092+cyclinder@users.noreply.github.com> * chore: remove accidental local build artifacts Agent-Logs-Url: https://github.com/spidernet-io/spiderpool/sessions/f9f1436f-70c0-4d93-8f46-ad7753b01d71 Co-authored-by: cyclinder <59680092+cyclinder@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: cyclinder <59680092+cyclinder@users.noreply.github.com> * Stabilize subnet e2e IPPool record assertions in nightly IPv4 run (#5637) * Initial plan * test(e2e): stabilize subnet ippool record checks with retries Agent-Logs-Url: https://github.com/spidernet-io/spiderpool/sessions/89afecd4-2b4f-400f-bcb7-8930fa84ff30 Co-authored-by: cyclinder <59680092+cyclinder@users.noreply.github.com> * chore: revert unintended binary build artifacts Agent-Logs-Url: https://github.com/spidernet-io/spiderpool/sessions/89afecd4-2b4f-400f-bcb7-8930fa84ff30 Co-authored-by: cyclinder <59680092+cyclinder@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: cyclinder <59680092+cyclinder@users.noreply.github.com> * Stabilize annotation E2E against transient namespace visibility race (#5642) * Initial plan * test(e2e): retry pod creation when namespace cache not ready --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> * Stabilize KubeVirt install in nightly e2e by making virt-operator rollout timeout configurable (#5641) * Initial plan * test: make kubevirt rollout timeout configurable for CI stability * chore: remove unintended build artifacts --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> * coordinator: detect SLAAC v6 on iface when PrevResult.IPs lacks v6 (#5619) * coordinator: detect SLAAC v6 via iface scan when PrevResult.IPs lacks v6 PrevResult.IPs is populated only by upstream IPAMs. v6 addresses obtained via kernel SLAAC (RA-driven autoconf) never appear there even though they are configured on the macvlan child, which leaves ipFamily stuck at FAMILY_V4 and silently skips every v6 path in the coordinator: host /128 routes, hijack-route gateway, and sysctl reconciliation. Add GetIPFamilyByResultWithIface that wraps the existing pure parser and falls back to scanning the pod iface (in args.Netns) for non-link-local v6 addresses. The coordinator opens the pod netns up front and uses the new helper, so ipFamily resolves to FAMILY_ALL for the common "DHCPv4 IPAM + kernel SLAAC v6" macvlan setup. CmdAdd logs at Info when the fallback fires so operators can see why the family was upgraded. The pure GetIPFamilyByResult is kept unchanged for callers without a netns. New unit tests cover the parsing paths and every branch of the fallback (iface scan mocked via gomonkey). Fixes #5618 Signed-off-by: Kevin Park <kevin.park1217@gmail.com> * coordinator: filter tentative/DAD-failed and reserved v6 addresses in getAdders RFC 4862 §5.4 ("Duplicate Address Detection") says a tentative address is not considered assigned to an interface, and §5.4.5 says a DAD-failed address MUST NOT be assigned. Without filtering these in getAdders, the SLAAC v6 fallback path could install host /128 routes and NUD_PERMANENT neighbor entries against an address that the kernel removes moments later — observable in a small window during CNI ADD on busy macvlan parents (vishvananda/netlink.AddrList does not flag-filter). While here, also drop loopback (::1) and unspecified (::) — RFC 4291 §2.5.2/§2.5.3 reserved, IANA Source=False/Destination=False. They're excluded by accident today because all current call sites filter "^lo$", but adding them to the in-place predicate matches Calico's host-iface filter and protects future refactors. Zero behavior change on the documented call paths. Verified against real Linux DAD on a veth pair (local integration test): tentative SLAAC v6 is correctly filtered until DAD completes (~1-2s), then picked up. Refs #5618 Signed-off-by: Kevin Park <kevin.park1217@gmail.com> * networking: add real-netns integration tests for SLAAC v6 fallback Mirrors the test pattern in cmd/spiderpool/cmd/command_test.go: standard ginkgo _test.go (no build tag), label "unittest" so it's picked up by make unittest-tests, real netns via testutils.NewNS, real interfaces via netlink.LinkAdd. Covers the same scenarios as the mocked unit tests plus one the mocks can't exercise: a SLAAC v6 added without IFA_F_NODAD on a veth pair — verifies the IFA_F_TENTATIVE filter ignores the address until kernel DAD completes, then picks it up (RFC 4862 §5.4). Lives in its own ginkgo suite (pkg/networking/networking/integration) rather than the same _test.go as the mocked unit tests. testutils.NewNS intentionally does not UnlockOSThread (containernetworking/plugins testutils/netns_linux.go:116-118 — Go ≥1.10 kills the locked thread when the goroutine exits), and the resulting OS-thread churn was making gomonkey machine-code patches in sibling specs flaky depending on ginkgo's spec randomization. Separate test binaries fully isolate the two test styles; verified stable across 10 consecutive seeded runs. Refs #5618 Signed-off-by: Kevin Park <kevin.park1217@gmail.com> * coordinator: actively solicit an RA when SLAAC v6 hasn't arrived yet The passive iface scan in #5618's first commit only helps setups where the SLAAC v6 is already on the iface at CNI-ADD time. For the common case where it isn't yet — multus chains where macvlan brings the iface up before tuning sets accept_ra=2, combined with the pod inheriting net.ipv4.ip_forward=1 (k3s default), so the kernel skips the link-up RS and ignores any periodic RA at accept_ra=1 (net/ipv6/addrconf.c ipv6_accept_ra: forwarding=1 + accept_ra<2 returns false) — the passive fix is a no-op. Add SolicitRouterAndWaitForSLAACv6: sends an ICMPv6 Router Solicitation to ff02::2 from inside the pod netns and polls AddrList until a usable non-link-local v6 appears (or timeout). Two-phase wait under one timeout budget: 1. Wait for a non-tentative link-local (RFC 4862 §5.4: kernel refuses to source from a tentative address — observable as EADDRNOTAVAIL on sendmsg) — typically <1.5s. 2. Send the RS (RFC 4861 §4.1, §6.3.5) and poll for the SLAAC GUA. CmdAdd now resolves ipFamily in three layers: PrevResult.IPs → passive iface scan → router-solicited active scan. The active path is only exercised when the prior two come up v4-only, so the latency cost is paid only when the fix is the user's last chance for v6 routing. Integration tests cover: empty-input error, missing-iface error, no- router timeout-without-error, and the early-return when a v6 is already present (regression case for setups that don't need the active path). Refs #5618 Signed-off-by: Kevin Park <kevin.park1217@gmail.com> * coordinator: gate the RS solicit on the kernel's accept_ra/forwarding state Address two style-review nits and prevent a v4-only regression. Gate: SolicitRouterAndWaitForSLAACv6 now reads /proc/sys/net/ipv6/conf/<iface>/{accept_ra,forwarding} and skips the RS unless the kernel's ipv6_accept_ra predicate (net/ipv6/addrconf.c) would process the elicited RA. Before this, v4-only setups paid the full ~3s solicit timeout on every CmdAdd despite the kernel ignoring the response. GoDoc on SolicitRouterAndWaitForSLAACv6 trimmed from a 37-line RFC-cited block to ~10 lines, matching the density of sibling exported helpers in pkg/networking/networking/packet.go. The longer-form rationale lives in the PR description and #5618. integration/doc.go expanded with the gomonkey + testutils.NewNS() OS-thread-leak rationale so reviewers see why this subpackage exists (no precedent for `<pkg>/integration` in the repo otherwise). Refs #5618 Signed-off-by: Kevin Park <kevin.park1217@gmail.com> * coordinator: clarify the kernel-drops-RA wording, note caller contract, test the gate Three small follow-ups from review: - Reword the SolicitRouterAndWaitForSLAACv6 docstring to say the kernel drops (not ignores) the early periodic RAs, and explicitly state the caller contract: accept_ra must be permissive on the iface before the function is invoked (typically via the upstream tuning plugin earlier in the NAD), otherwise the function exits without sending an RS. - Add an integration test that exercises the gate's block path (accept_ra=1 + forwarding=1 → returns before Phase 1's first poll). Wraps sysctl writes in a Skip() guard so the test runs in CI (sudo on a real Linux host) but cleanly skips on contributor laptops where /proc/sys is read-only. Refs #5618 Signed-off-by: Kevin Park <kevin.park1217@gmail.com> * coordinator: revert getAdders filter; drop active RS solicit per review Two fixes rolled together because they touch adjacent code: 1. Revert the IFA_F_TENTATIVE/IFA_F_DADFAILED/IsLoopback/IsUnspecified filters in getAdders. The TENTATIVE filter broke setupNeighborhood's stale-entry cleanup — c.currentAddress now drops the pod's v6 while DAD is in progress, so the pre-populated stale neighbor on the host is never matched and the entry isn't deleted. Observable as e2e MacvlanOverlayOne / "auto clean up the dirty rules" timing out on dual / ipv6 matrix slots. The standards-review reasoning that "tentative is not assigned per RFC 4862 §5.4" is technically correct, but the rest of the coordinator codebase intentionally uses tentative addresses for neighbor / route bookkeeping (the pod is assigned that IP even if DAD hasn't completed). Filtering at the getAdders layer broke that contract. 2. Drop SolicitRouterAndWaitForSLAACv6 + defaultSLAACSolicitTimeout + the active-path call site in CmdAdd per @cyclinder's review on #5619. Coordinator now only does the passive iface scan when PrevResult.IPs lacks v6. Comment #1 (drop the prevResultFamily re-parse and the familyDiscovery tracking) folded in. The active RS remains the right fix for setups hitting the macvlan→tuning accept_ra-flip-after-link-up timing race where the kernel drops the early periodic RAs and won't auto-resend RS. Revisitable as a separate CNI plugin or follow-up PR after this lands. Refs #5618 Signed-off-by: Kevin Park <kevin.park1217@gmail.com> * networking/integration: dedupe redundant DescribeTable entry The previous "v4-only IPAM, no v6 on iface" and "v4 IPAM + only link-local v6" entries had identical inputs (prev=[v4], ifaceV4=[v4], ifaceV6=nil) and the same expected output (FAMILY_V4). The kernel auto-assigns fe80:: on every dummy iface at link-up regardless of what we add, so the "LL filtered" semantic was already covered by the other entry. Collapsed into a single entry with the explanatory description. Refs #5618 Signed-off-by: Kevin Park <kevin.park1217@gmail.com> * coordinator: read iface for ipFamily, drop PrevResult parsing path Per @cyclinder's review: scanning the iface in the pod netns is the universal way to determine ipFamily — works for any IPAM and for kernel SLAAC alike, simpler than the parse-PrevResult-then-fallback machinery this PR previously added. - Add GetIPFamilyByIface(netns, ifName) in pkg/networking/networking/ip.go. - Remove GetIPFamilyByResultWithIface and the fallback branches it had. - command_add.go calls the new helper directly; PrevResult parsing in CmdAdd is now just a shape-validation pass. - GetIPFamilyByResult (pre-existing pure parser) is untouched for backward compat. Unit tests for GetIPFamilyByIface narrowed to input validation (nil netns, empty ifName) since the behavioral coverage (v4 / v6 / dual / no-addrs / missing-iface) is exercised against real netlink in the integration suite; gomonkey-based behavioral mocking proved flaky. Refs #5618 Signed-off-by: Kevin Park <kevin.park1217@gmail.com> --------- Signed-off-by: Kevin Park <kevin.park1217@gmail.com> * Add IaaS network provider integration support (#5573) * Add IaaS network provider integration support * add IaaS provider configuration parameters to values.yaml and README * mount IaaS TLS secret to controller and agent pods * add IaaS provider config to configmap * implement IaaS client with mTLS support for allocate/release IP operations * add IaaS config validation in controller and agent daemon startup * integrate IaaS client into IPAM workflow Signed-off-by: Cyclinder Kuo <qifeng.guo@daocloud.io> * Add vlan-cni plugin support to spiderpool-plugins image * add VLAN_VERSION build argument and environment variable * clone vlan-cni repository and build vlan binary * copy vlan binary to release image at /usr/plugins/vlan * add --install-vlan flag to entrypoint.sh for conditional installation * add vlan plugin installation logic with version logging * export VLAN_VERSION in GitHub Actions workflow outputs * set default VLAN_VERSION to 0.0.1 in version.sh Signed-off-by: Cyclinder Kuo <qifeng.guo@daocloud.io> * fix: move link setup from CmdAdd to DetectIPConflictAndGatewayReachable * move netlink link setup logic from command_add.go to ipam_detection.go * set link up inside DetectIPConflictAndGatewayReachable before detection * change early return to continue when IP version is nil during detection * remove netlink import from command_add.go Signed-off-by: Cyclinder Kuo <qifeng.guo@daocloud.io> --------- Signed-off-by: Cyclinder Kuo <qifeng.guo@daocloud.io> * bump version to v1.2.0 --------- Signed-off-by: weizhoublue <weizhou.lan@daocloud.io> Signed-off-by: Kevin Park <kevin.park1217@gmail.com> Signed-off-by: Cyclinder Kuo <qifeng.guo@daocloud.io> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Kevin Park <kevin.park1217@gmail.com>
Thanks for contributing!
Notice:
"release/none"
"release/bug"
"release/feature"
What issue(s) does this PR fix:
Fixes #
Special notes for your reviewer:
Release note: